1,802 research outputs found
Automatic speech recognition system development in the “wild“
The standard framework for developing an automatic speech recognition (ASR) system is to generate training and development data for building the system, and evaluation data for the final performance analysis. All the data is assumed to come from the domain of interest. Though this framework is matched to some tasks, it is more challenging for systems that are required to operate over broad domains, or where the ability to collect the required data is limited. This paper discusses ASR work performed under the IARPA MATERIAL program, which is aimed at cross-language information retrieval, and examines this challenging scenario. In terms of available data, only limited narrow-band conversational telephone speech data was provided. However, the system is required to operate over a range of domains, including broadcast data. As no data is available for the broadcast domain, this paper proposes an approach for system development based on scraping "related" data from the web, and using ASR system confidence scores as the primary metric for developing the acoustic and language model components. As an initial evaluation of the approach, the Swahili development language is used, with the final system performance assessed on the IARPA MATERIAL Analysis Pack 1 data.The Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Air Force Research Laboratory (AFRL
Recommended from our members
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks
Confidence Estimation for Black Box Automatic Speech Recognition Systems Using Lattice Recurrent Neural Networks
Recently, there has been growth in providers of speech transcription services
enabling others to leverage technology they would not normally be able to use.
As a result, speech-enabled solutions have become commonplace. Their success
critically relies on the quality, accuracy, and reliability of the underlying
speech transcription systems. Those black box systems, however, offer limited
means for quality control as only word sequences are typically available. This
paper examines this limited resource scenario for confidence estimation, a
measure commonly used to assess transcription reliability. In particular, it
explores what other sources of word and sub-word level information available in
the transcription process could be used to improve confidence scores. To encode
all such information this paper extends lattice recurrent neural networks to
handle sub-words. Experimental results using the IARPA OpenKWS 2016 evaluation
system show that the use of additional information yields significant gains in
confidence estimation accuracy. The implementation for this model can be found
online.Comment: 5 pages, 8 figures, ICASSP submissio
Recommended from our members
Unicode-based graphemic systems for limited resource languages
© 2015 IEEE. Large vocabulary continuous speech recognition systems require a mapping from words, or tokens, into sub-word units to enable robust estimation of acoustic model parameters, and to model words not seen in the training data. The standard approach to achieve this is to manually generate a lexicon where words are mapped into phones, often with attributes associated with each of these phones. Contextdependent acoustic models are then constructed using decision trees where questions are asked based on the phones and phone attributes. For low-resource languages, it may not be practical to manually generate a lexicon. An alternative approach is to use a graphemic lexicon, where the 'pronunciation' for a word is defined by the letters forming that word. This paper proposes a simple approach for building graphemic systems for any language written in unicode. The attributes for graphemes are automatically derived using features from the unicode character descriptions. These attributes are then used in decision tree construction. This approach is examined on the IARPA Babel Option Period 2 languages, and a Levantine Arabic CTS task. The described approach achieves comparable, and complementary, performance to phonetic lexicon-based approaches
Recommended from our members
A language space representation for speech recognition
© 2015 IEEE. The number of languages for which speech recognition systems have become available is growing each year. This paper proposes to view languages as points in some rich space, termed language space, where bases are eigen-languages and a particular selection of the projection determines points. Such an approach could not only reduce development costs for each new language but also provide automatic means for language analysis. For the initial proof of the concept, this paper adopts cluster adaptive training (CAT) known for inducing similar spaces for speaker adaptation needs. The CAT approach used in this paper builds on the previous work for language adaptation in speech synthesis and extends it to Gaussian mixture modelling more appropriate for speech recognition. Experiments conducted on IARPA Babel program languages show that such language space representations can outperform language independent models and discover closely related languages in an automatic way
Low-resource speech recognition and keyword-spotting
© Springer International Publishing AG 2017. The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented
Hyperspectral imaging to measure apricot attributes during storage
The fruit industry needs rapid and non-destructive techniques to evaluate the quality of the products in the field and during the post-harvest phase. The soluble solids content (SSC), in terms of °Brix, and the flesh firmness (FF) are typical parameters used to measure fruit quality and maturity state. Hyperspectral imaging (HSI) is a powerful technique that combines image analysis and infrared spectroscopy. This study aimed to evaluate the potential of the application of the Vis/NIR push-broom hyperspectral imaging (400 to 1000 nm) to predict the firmness and the °Brix in apricots (180 samples) during storage (11 days). Partial least squares (PLS) and artificial neural networks (ANN) were used to develop predictive models. For the PLS, R2 values (test set) up to 0.85 (RMSEP=1.64 N) and 0.72 (RMSEP=0.51 °Brix) were obtained for the FF and SSC, respectively. Concerning the ANN, the best results in the test set were achieved for the FF (R2=0.85, RMSEP=1.50 N). The study showed the potential of the HSI technique as a non-destructive tool for measuring apricot quality even along the whole supply chain
Incorporating uncertainty into deep learning for spoken language assessment
There is a growing demand for automatic
assessment of spoken English proficiency.
These systems need to handle large vari-
ations in input data owing to the wide
range of candidate skill levels and L1s, and
errors from ASR. Some candidates will
be a poor match to the training data set,
undermining the validity of the predicted
grade. For high stakes tests it is essen-
tial for such systems not only to grade
well, but also to provide a measure of
their uncertainty in their predictions, en-
abling rejection to human graders. Pre-
vious work examined Gaussian Process
(GP) graders which, though successful, do
not scale well with large data sets. Deep
Neural Networks (DNN) may also be used
to provide uncertainty using Monte-Carlo
Dropout (MCD). This paper proposes a
novel method to yield uncertainty and
compares it to GPs and DNNs with MCD.
The proposed approach
explicitly
teaches
a DNN to have low uncertainty on train-
ing data and high uncertainty on generated
artificial data. On experiments conducted
on data from the Business Language Test-
ing Service (BULATS), the proposed ap-
proach is found to outperform GPs and
DNNs with MCD in uncertainty-based re-
jection whilst achieving comparable grad-
ing performance
Inflectional loci of scrolls
Let be a scroll over a smooth curve and let
\L=\mathcal O_{\mathbb P^N}(1)|_X denote the hyperplane bundle. The special
geometry of implies that some sheaves related to the principal part bundles
of \L are locally free. The inflectional loci of can be expressed in
terms of these sheaves, leading to explicit formulas for the cohomology classes
of the loci. The formulas imply that the only uninflected scrolls are the
balanced rational normal scrolls.Comment: 9 pages, improved version. Accepted in Mathematische Zeitschrif
- …